Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch reductions #447

Merged
merged 29 commits into from
Aug 11, 2021
Merged

Batch reductions #447

merged 29 commits into from
Aug 11, 2021

Conversation

neworderofjamie
Copy link
Contributor

@neworderofjamie neworderofjamie commented Aug 5, 2021

So the batching system (#392) let you do parallel inference but, in order to do parallel training, you need to be able to sum up (reduce) what you're learning online across all elements in the batch and apply them to the (shared) weights. This PR implements this via some new VarAccess modes: REDUCE_SUM and REDUCE_MAX which signal that writes to these variables should be reductions. I went backwards and forwards a lot about the syntax for this one but ended up not really adding any new syntax so a gradient reduce and zeroing custom update might look like this:

gradient_batch_reduce_model = genn_model.create_custom_custom_update_class(
    "gradient_batch_reduce",
    var_name_types=[("reducedGradient", "scalar", VarAccess_REDUCE_BATCH_SUM)],
    var_refs=[("gradient", "scalar")],
    update_code="""
    $(reducedGradient) = $(gradient);
    $(gradient) = 0;
    """)

The nice thing with this lack of syntax is that you can do stuff like implement softmax with $(reducedGradient) = exp($(gradient)); (although that doesn't make a lot of sense reducing across batches) and backends which don't support batching (like single-threaded CPU) can basically just stick in a write back to global memory after the generic code generation and this will automatically turn into an (unnecessary) copy operation.

Because typically NUM_BATCHES << NUM_WEIGHTS, this reduction is quite different than those typically talked about in the literature (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) so I dug into the TF source to see how they implement reductions of this type and, for NUM_WEIGHTS > 4096, they use this very simple algorithm:

  1. Launch kernel with one thread per weight/neuron
  2. Initialise register to a suitable initial value (hence the 'type system' wrangling so a sensible initial value for REDUCE_MAX can be established)
  3. In each thread loop over the batches and apply the reduction operation
  4. Write the reduced value back to memory.

This makes sense as you get good coalescing of global memory reads and no need for atomics etc and, as GeNN will fuse any compatible reductions together so they're run in parallel, I think any reasonable model will easily occupy the GPU (which I think the 4096 vaguely represents).

I've been using this to do parallel eProp where one of these reductions on the gradients is followed by an Adam optimizer custom update which applies the now-non-batched gradients to the shared weights (via #446). However, it's pretty flexible so you could actually use it with STDP or whatever - you'd apply your STDP rule to a deltaG variable which would be duplicated across the batches, reduce these and add them to the (shared) weights.

On the Titan V, increasing the batch size gives decreases the effective time to train a single stimuli by around 4.5x as shown:
Stimuli time  ms  vs Batch size

neworderofjamie and others added 27 commits July 20, 2021 15:53
# Conflicts:
#	include/genn/genn/currentSource.h
#	include/genn/genn/neuronGroup.h
#	src/genn/genn/synapseGroup.cc
…`` and ``CustomUpdateWUGroupMergedBase::getVarRefIndex``
…:isReduction`` so they are set irrespective of actual batch size of model
…lSpec::addCustomUpdate`` rather than only when finalizing model (always a good thing)
…ppers to handle transposes involving custom WU update variablesadd additional error to prevent reduction and transpose operations being attempted simultaneously
@neworderofjamie neworderofjamie added this to the GeNN 4.6.0 milestone Aug 5, 2021
@neworderofjamie neworderofjamie marked this pull request as ready for review August 5, 2021 18:05
@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

Merging #447 (8bd66ac) into master (78faf0d) will increase coverage by 0.07%.
The diff coverage is 92.20%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #447      +/-   ##
==========================================
+ Coverage   88.00%   88.08%   +0.07%     
==========================================
  Files          78       78              
  Lines       16605    16824     +219     
==========================================
+ Hits        14614    14820     +206     
- Misses       1991     2004      +13     
Impacted Files Coverage Δ
include/genn/genn/code_generator/codeGenUtils.h 98.07% <ø> (ø)
include/genn/genn/code_generator/groupMerged.h 89.68% <ø> (ø)
include/genn/genn/currentSource.h 77.77% <ø> (+19.44%) ⬆️
include/genn/genn/currentSourceModels.h 50.00% <ø> (ø)
include/genn/genn/customUpdateInternal.h 0.00% <ø> (ø)
include/genn/genn/customUpdateModels.h 50.00% <ø> (ø)
include/genn/genn/gennUtils.h 100.00% <ø> (ø)
include/genn/genn/neuronGroup.h 72.72% <ø> (-1.42%) ⬇️
include/genn/genn/postsynapticModels.h 100.00% <ø> (ø)
src/genn/backends/single_threaded_cpu/backend.cc 56.94% <37.50%> (-0.16%) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78faf0d...8bd66ac. Read the comment docs.

Copy link
Member

@tnowotny tnowotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comment about not initialising reduction type variables below it all makes sense ...

// Loop through variable references
for(const auto &v : cm->getVarRefs()) {
// If variable reference is a reduction target, define variable initialised to correct initial value for reduction
// **NOTE** by not initialising this, compilers should emit a warning if user code doesn't set it to something
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm ... I am not quite sure I understand what is going on here. Is this an old comment (you seem to be initialising below after all) or am I missing the entire plot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot - I've moved the comment to where it belongs

# Conflicts:
#	include/genn/genn/currentSource.h
#	include/genn/genn/neuronGroup.h
@neworderofjamie neworderofjamie merged commit 36d3e83 into master Aug 11, 2021
@neworderofjamie neworderofjamie deleted the reductions branch August 11, 2021 12:54
@neworderofjamie neworderofjamie mentioned this pull request Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants